BMC Bioinformatics — Latest Matching Preprints

1

SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis

Han, J.; Luo, W.; Baldwin, E.; Zhang, H. H.; An, L.; Liu, J.; Li, H.

2026-06-24 bioinformatics 10.64898/2026.06.18.733287 medRxiv

Top 0.1%

30.4%

Show abstract

MotivationWith rapid advances in sequencing technologies, many heterogeneous omics datasets have been generated, as seen in the Encyclopedia of DNA Elements (ENCODE) and many single-cell multi-omics sequencing projects, bringing substantial challenges to existing integrative methods. In this article, we report a novel multi-omics fusion and analysis software SEMFA which performs general parametric tests for the Mahalanobis Similarity of samples based on the factor scores generated by an Extended version of conventional Multiple Factor Analysis. ResultsOur developed method is effective and robust under both Gaussian and non-Gaussian assumptions. The mean F1 scores are over 0.8 when the column similarity level is 0.9 and the noise level ranges between 0.1 and 0.2, using simulation studies based on ENCODE count data. It was also efficient and effective at handling large-scale single-cell multi-omics data, as demonstrated in colon cancer cases as it unveiled signature network organization patterns of cells for stages III and IV.

2

Bamsnap-LRS: an automated batch visualization tool for long-read sequencing alignments

Chen, W.; Yang, C.; Qiu, L.; Hu, J.; Zhou, Y.

2026-06-25 bioinformatics 10.64898/2026.06.21.733121 medRxiv

Top 0.3%

14.9%

Show abstract

Summary: Long-read sequencing (LRS) has become essential for genome assembly, structural variations (SVs) detection, haplotype phasing and transcript isoform characterization. However, these applications often require manual inspection of read alignment for validation. Existing visualization tools are either interactive genome browsers that are difficult to scale to large datasets or batch-oriented tools that are not optimized for the unique alignment patterns of long-read data. We developed Bamsnap-LRS, an automated command-line tool for high-throughput LRS alignment visualization. It supports long-read-specific features, phased SNP inspection, and publication-ready batch figure generation within a unified framework for genomic, transcriptomic, and haplotype-aware analyses. Availability and Implementation: All codes and examples are freely available at https://github.com/comery/Bamsnap-LRS.

3

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 0.3%

13.4%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

4

CNSigs: An R Package for the Identification of Copy Number Mutational Signatures

Tallman, D.; Striker, S.; Byappanahalli, A. M.; Stockard, S.; Jenison, J.; Collier, K. A.; Blige, E.; Vater, M.; Stover, D. G.

2026-06-25 bioinformatics 10.64898/2026.06.21.733646 medRxiv

Top 0.3%

13.2%

Show abstract

BackgroundCopy number aberrations (CNAs) are gains and losses of large genomic segments present across most cancer types and are a hallmark of cancer genomic alterations. However, the processes underlying CNAs and characteristic patterns of CNAs are poorly understood. Bioinformatic advances have identified underlying single nucleotide variant (SNV) mutational signatures resulting from distinct mutational processes, yet development of algorithms able to uncover similar signatures for CNAs remains less advanced. MethodsUsing segmented data files from DNA sequencing, six copy number features are extracted for signature determination: segment size, breakpoints per 10 megabases, copy number oscillation events, average changepoint size, average copy number, and breakpoints per chromosome arm, along with ploidy. Mixed model approaches and non-negative matrix factorization (NMF) are utilized to derive CNA signatures across cancer types. The full methodology was packaged in a robust R package, termed CNSigs that is publicly available. ResultsTo verify the reproducibility of the signatures, we derived five signatures from two independent breast cancer datasets (total n>3000), demonstrating high accuracy (average cosine similarity = 0.89). Pan-cancer application of CNSigs in the TCGA dataset resulted in derivation of 13 pan-cancer signatures which were significantly associated with disease-specific survival. Benchmarking CNSigs to two other CNA signature approaches within TCGA demonstrated non-overlapping signatures and favorable compute speed for CNSigs. We evaluated n=24 pairs of tumor and circulating tumor DNA (ctDNA) acquired at the same time and demonstrated that CNSigs are detectable and reproducible via ctDNA, with significant association of CNSig11 with metastatic triple-negative breast cancer progression-free survival for taxane but not platinum or capecitabine chemotherapy. CNSigs association with immunophenotype was evaluated in low-grade glioma (LGG) and CNSig 3 was found to be highly prognostic for LGG yet complementary to immune features. ConclusionsThe CNSigs R package allows researchers to easily analyze their own samples to derive copy number signatures and evaluate clinical associations. We demonstrate potential application in ctDNA and association with treatment response. The development of this package allows further investigation of underlying processes that may be responsible for these CNA fingerprints.

5

GBZ-base and GAF-base: Indexed pangenome file formats

Siren, J.; Paten, B.; the Human Pangenome Reference Consortium,

2026-07-11 bioinformatics 10.64898/2026.07.10.737775 medRxiv

Top 0.4%

12.8%

Show abstract

MotivationExisting pangenome file formats are designed for batch processing. Graphs must be loaded into memory, and alignment files must be read sequentially. Indexed file formats that can be used directly from disk would be more appropriate for interactive applications. ResultsWe propose GBZ-base and GAF-base -- SQLite-backed file formats comparable to GBZ and GAF. GBZ-base supports efficient extraction of local subgraphs, and GAF-base lets us extract all alignments to the subgraph. Additionally, GAF-base is smaller than any other file format for sequence-to-graph alignments. Availability and implementationFrom https://github.com/jltsiren/gbz-base and https://crates.io/crates/gbz-base under the MIT license.

6

Beyond infinite sites: Generalized ABBA-BABA statistic for deeper phylogenies

Zhang, C.; Nielsen, R.

2026-07-08 bioinformatics 10.64898/2026.07.06.736715 medRxiv

Top 0.4%

12.4%

Show abstract

The Patterson's D statistic detects gene flow from ABBA-BABA site patterns, but its biallelic site patterns fail under deeper divergences where multiple hits cause false positives. We propose two extensions, D+ and D*. Both incorporate multiallelic site patterns to reduce saturation bias under JC and F84 model. Simulations show that D+ and D* both remain correctly null under all conditions and detect gene flow effectively, with distinct advantages: D+ guarantees non-negativity of the denominator, while D* provides greater robustness when mutation rates vary across genomic regions. The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER.

7

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.5%

12.2%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

8

pylimma: a faithful, AnnData-native Python port of R limma for differential expression analysis

Mulvey, J.

2026-07-10 bioinformatics 10.64898/2026.07.06.736732 medRxiv

Top 0.5%

11.8%

Show abstract

pylimma is a faithful Python port of limma, intended to bring one of the most widely used tools for differential expression analysis to the developing Python ecosystem for transcriptomics and proteomics. We validated pylimma against the existing R implementation through 227 function-level comparisons and across six real world datasets spanning microarray, RNAseq, proteomics and single-cell transcriptomics. pylimma reproduces limmas numerical output to a median agreement of 13 significant figures and calls identical sets of differentially expressed features and gene sets. This supports its use as a drop-in replacement for the R implementation.

9

PACMOS: an R package for Projection And Classification of Multi-Omic Samples

Kalson, L.; Sexton-Oates, A.; Drevet, G.; Fernandez-Cuesta, L.; Foll, M.; Alcala, N.

2026-07-07 bioinformatics 10.64898/2026.07.01.735542 medRxiv

Top 0.6%

11.4%

Show abstract

Motivation: Integrated multi-omic analyses have transformed our understanding of cancer biology, giving rise to data-driven molecular classifications that capture disease heterogeneity beyond conventional histopathology. Among these approaches, multi-omic factor analysis (MOFA), a multimodal extension of principal component analysis, has been widely used to identify sources of molecular variation across omic layers and classify samples into molecular groups. However, classifying query samples according to an existing MOFA-based classification remains challenging, as there is no validated computational method for projecting samples into pretrained MOFA latent factor spaces. Results: We present PACMOS, an R package that provides a generalizable approach to project query samples into pretrained MOFA latent factor spaces. We validate PACMOS using two cancer datasets with published MOFA-based classifications - lung neuroendocrine neoplasms and pleural mesothelioma - showing that PACMOS preserves the existing MOFA latent factor space while allowing to classify query samples. Availability and implementation: PACMOS is an open-source R package available on the IARC bioinformatics GitHub organization (submitted to Bioconductor) at https://github.com/IARCbioinfo/PACMOS and DOI in Zenodo: https://doi.org/10.5281/zenodo.20933824, along with installation instructions and a vignette with an application. Supplementary information: Supplementary data are available in separate files.

10

PARROT: Phase-Altering Regulatory Rewiring Over Time

Chen, C.; Padi, M.; Quackenbush, J.

2026-06-28 bioinformatics 10.64898/2026.06.24.734262 medRxiv

Top 0.7%

9.8%

Show abstract

MotivationGene regulatory networks undergo dynamic restructuring during development and disease. Identifying when and how these networks change is crucial for understanding developmental and disease transitions, yet existing change-point detection methods often ignore network structure or lack interpretable community assignments. ResultsWe present PARROT (Phase-Altering Regulatory Rewiring Over Time), a framework for detecting change-points in dynamic networks using Stochastic Block Models. PARROT jointly estimates change-point locations and community structure across four network classes: unipartite and bipartite with either Gaussian or Bernoulli edge models. Simulations demonstrate improved performance and community recovery compared to other methods. Applications to human cardiac differentiation and mouse lung development data successfully recovered known phase boundaries. PARROT identifies both which genes are reassigned across modules and how the connections change between states. AvailabilityPARROT is available as an R package at https://github.com/cchen22/PARROT. Contactchenchen9945@gmail.com Supplementary informationSupplementary data are available at Bioinformatics online.

11

High-Quality Predicted Pathway Annotations Greatly Improve Pathway Enrichment Analysis of Metabolomics Datasets

Huckvale, E. D.; Thompson, P. T.; Flight, R. M.; Moseley, H. N. B.

2026-07-08 systems biology 10.1101/2025.11.18.689105 medRxiv

Top 0.7%

9.6%

Show abstract

Background/ObjectivesMetabolism-level interpretation of metabolomics datasets requires aggregation analyses across metabolites. One highlyused aggregation analysis is pathway enrichment analysis (PEA), which involves detecting pathways enriched with metabolites that are differential between experimental groups. Annotating metabolites with pathway associations is a prerequisite for PEA. While several knowledgebases define pathways and include metabolite-pathway annotations, these definitions are often partially or even grossly incomplete due to limitations in current metabolic knowledge and its curation, which greatly limits the effectiveness of PEA. MethodsIn this work, we used a novel multitask classification, graph convolutional-like neural network to generate high-quality metabolite-pathway annotations for pathways defined across KEGG, MetaCyc, and Reactome. We then included these predicted metabolite-pathway annotations when performing PEA on 990 datasets deposited in Metabolomics Workbench. ResultsWe demonstrate an 8-fold increase in the median number of enriched pathways detected across these datasets compared to using only knowledgebase-derived annotations. ConclusionsThe significant increase in enriched pathways substantially improves the biological and biomedical interpretability of metabolomics datasets.

12

A systematic analysis of machine learning pipelines for robust antimicrobial resistance prediction

Aselstyne, A.; Karthik, E. N.; El Azami, M.; Pogorelcnik, R.; Fournier, Q.; Chandar, S.

2026-07-08 bioinformatics 10.64898/2026.06.28.734076 medRxiv

Top 0.8%

9.6%

Show abstract

Motivation: Antimicrobial resistance (AMR) has been identified as a top global public health threat. Accurate AMR phenotype prediction from whole-genome sequencing data is an essential tool for accelerating clinical decision-making and mitigating resistance spread. Although many previous works have explored the use of tree-based machine learning (ML) models to predict resistance, the field lacks a systematic evaluation of the training pipeline across a variety of pathogenic species and antibiotics. Results: Using nine clinically relevant species-antibiotic combinations from the NCBI antimicrobial susceptibility testing database, we present a detailed analysis of the ML pipeline and identify key factors affecting model performance and evaluation. We begin by relabelling all isolates using current CLSI minimum inhibitory concentration breakpoints to resolve inconsistencies and increase available data, resulting in up to a 19% label swap and 56% data enlargement per species-antibiotic combination. We identify several key training parameters including k-mer length, which can increase classification F1 scores by over 20 points compared to commonly used k-values, feature matrix truncation, which can induce polynomial time reductions with limited performance reduction, and ML model class. By comparing 5-fold cross-validation with evaluation on an unseen clinical dataset, we show that random cross-validation splits--often criticized as overly optimistic--can act as a strong proxy for downstream clinical performance, yielding closer F1 scores than phylogeny-aware splits in all cases. We finally present an interpretability study which shows that over 95% of k-mers used by our models are associated with identifiable genomic features. Our results highlight the importance of feature design, evaluation protocol, and biological analysis in genomic AMR prediction, and support tree-based models as a robust and interpretable method.

13

trAIt: Species-by-Trait Data Retrieval using Large Language Models

Balaji, S.; Martinson, K. A.; Schellenberger, J. S.; Koley, J.; Inman, C. M.; Hofmann, H. A.; Young, R. L.; Harpak, A.

2026-06-24 bioinformatics 10.64898/2026.06.19.732660 medRxiv

Top 0.8%

9.0%

Show abstract

Biological research often requires information about species traits. Manual literature collation can be time-consuming and miss parts of the literature. To address this gap, we developed trAIt, a publicly available software for the retrieval of characteristics of species from scientific literature catalogued in the Europe PubMed Central (PubMed) database. trAIt provides a graphical user interface (GUI) in which users specify species and characteristics of interest. Leveraging a large language model (LLM), trAIt retrieves relevant papers, combines their content through a consensus-based summarization model, and outputs a species-by-characteristic table. For a case study involving frog species, trAIt recovered 47.1% of trait-species combinations in 2.75 hours, while an expert curator independently recovered 62.4% over months. The consensus-based summarization substantially aids accuracy compared to single-source extraction. Across three case studies of vertebrate taxa, an expert confirmed the accuracy of 70.9% of trait-species entries recovered by trAIt. We observed considerable variation across taxa in trAIts accuracy, which is possibly due to heterogeneity in open-access literature availability and inconsistencies in species and trait terminology. In sum, our analysis suggests that LLM-based tools can accelerate biological data synthesis but should be used to support domain experts research, rather than replace their judgment.

14

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 0.9%

8.1%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

15

Confounding effects of inferring gene co-expression networks from pooled data from different biological populations

Runghen, R.; Eliassi-Rad, T.; Bolnick, D. I.

2026-06-29 bioinformatics 10.64898/2026.06.23.734063 medRxiv

Top 0.9%

8.0%

Show abstract

Weighted Gene Co-expression Network Analysis (WGCNA) is routinely applied to pooled datasets from multiple biological populations, genotypes, or treatment groups, implicitly assuming a shared module structure across groups. While the distortion of pairwise correlations by pooling heterogeneous groups is well established statistically, three aspects of this problem have received little systematic attention in the context of co-expression network analysis: the extent to which pooling disrupts the discrete module-level community structure inferred by WGCNA; whether this disruption is detectable from the global topology metrics researchers routinely report; and how prevalent the pooling practice is in published multi-group WGCNA studies. Using analytical toy examples and a four-scenario simulation framework, we address all three questions. Module preservation Zsummary scores declined progressively with between-population divergence, from full preservation under identical populations (mean median Zsummary = 25.2 {+/-} 3.3, 95% interval 19.0--30.7 across 20 simulation replicates) to substantial disruption when both network structure and mean expression differed (mean median Zsummary = 11.9 {+/-} 1.0, 95% interval 10.2--13.5). This disruption was undetectable from global topology metrics: modularity and clustering coefficient remained stable across all scenarios, while edge density was sensitive but non-specific. These findings were corroborated in an empirical reanalysis of divergent lake and stream stickleback transcriptomes, where merged analysis collapsed 26 lake-specific and 59 stream-specific modules into only 19 merged modules. A survey of 100 publications found that 78.7% (95% CI 69.4--87.9%) of multi-group WGCNA studies with sufficient methodological reporting used a single merged analysis. Results were robust across network sizes of 250--1,000 genes and rewiring rates of 10--50%. We provide concrete recommendations including module preservation testing in both directions, population-specific baseline networks, and consensus WGCNA as a principled alternative.

16

Weak form Scientific Machine Learning for Systems Biology: A Tutorial on WENDy

Heitzman-Breen, N.; Lyons, R.; Jain, P.; Jolly, M. K.; Bortz, D. M.

2026-07-09 systems biology 10.64898/2026.07.02.735880 medRxiv

Top 0.9%

8.0%

Show abstract

Mechanistic ordinary differential equation models are widely used in systems biology to represent biochemical networks, population dynamics, cell-state transitions, and other biological processes; however, their predictive value depends critically on accurate parameter estimation from noisy and often sparse experimental data. In this tutorial, we present the Weak-form Estimation of Nonlinear Dynamics (WENDy) method as a forward-solver-free approach that reformulates parameter estimation as a covariance-corrected weak-form regression problem by integrating the model equations against compactly supported test functions. We present the background on the methodology through the lens of the familiar logistic equation, and we demonstrate applications of the method on real experimental data through two systems biology examples: a glycolytic oscillator with relatively dense time-course data and a sparse epithelial-mesenchymal cellstate transition model with multiple experimental replicates. Ultimately, using WENDy, we estimate interpretable biological parameters with uncertainty for systems with noisy and sometimes sparse available experimental data.

17

Large-scale analysis of optimisation methods for parameter estimation problems in the life sciences

Grein, S.; Penas, D. R.; Weindl, D.; Lakrisenko, P.; Banga, J. R.; Hasenauer, J.

2026-07-13 systems biology 10.64898/2026.07.11.737731 medRxiv

Top 0.9%

8.0%

Show abstract

Dynamic models are central to the computational life sciences but typically contain unknown parameters that must be inferred from experimental data. High-throughput measurements have made this task increasingly challenging, yielding high-dimensional search spaces and non-convex objectives with many local optima. This makes the choice of optimisation method critical. However, existing empirical studies either consider only a limited number of benchmark problems or only a narrow spectrum of local, global and hybrid optimisation methods. Here, we present a comprehensive benchmark of a broad range of optimisation methods on a curated collection of parameter estimation problems, comprising 990 method-problem-pairs executed on two independent supercomputing infrastructures. Our evaluation quantifies success rates, solution quality and computational cost, revealing characteristic strengths and limitations of each approach. We find that optimisation methods separated into clear performance tiers. Building on these results, we implemented a new hybrid strategy that combines enhanced scatter search with the best-performing local solver, which showed robust performance and improved on the other scatter-search variants we tested. Our results provide practical guidance for selecting optimisation methods and thereby support more accurate and reliable model calibration.

18

Beyond statistical significance: ranking transcription factor binding motifs by effect size

Viner, C.; Mastromatteo, S.; Denisko, D.; Negrea, J.; Tang, Y.; Zhang, L.; Hoffman, M. M.; Sun, L.

2026-06-24 bioinformatics 10.64898/2026.06.19.732679 medRxiv

Top 1.0%

7.9%

Show abstract

Chromatin immunoprecipitation-sequencing (ChIP-seq) has wide use in identifying transcription factor binding sites. DNA sequence motifs specific to a targeted transcription factor occur more frequently near ChIP-seq peak centres. The most common approach to quantifying relative motif enrichment ranks motifs by p-value. Because sample sizes can vary substantially across examined motifs, p-value magnitudes may reflect this heterogeneity rather than the biological effect of interest. As alternatives, we considered four ranking methods based on effect sizes: (a) a modified Cliffs delta, (b) the lower bound of a frequentist asymptotic confidence interval, (c) the lower bound of a frequentist finite-sample confidence interval, and (d) the lower bound of a Bayesian credible region. Through extensive simulations, the four alternatives better recovered the simulated central-enrichment ordering under heterogeneous sample sizes. Using published ChIP-seq data for GATA3, the effect size methods ranked the known targeted motif highest, even compared to highly similar motifs for other GATA family members, while p-value ranking did not. In a separate SRF application, all four alternative methods also consistently ranked the known motif highest. We recommend the asymptotic confidence interval lower bound for its simplicity, ease of implementation, and intuitive interpretation. The software is freely available (https://github.com/ScottMastro/motif-ranking).

19

synpact: accurate, memory-light PacBio HiFi read mapping via a hierarchy of locally-consistent syncmer blocks

Aydin, M. S.; Sahlin, K.

2026-07-02 bioinformatics 10.64898/2026.06.28.735066 medRxiv

Top 1%

7.8%

Show abstract

Motivation: Mapping PacBio HiFi reads is a routine task and serves as a central step in many bioinformatics analyses. However, the most accurate long-read mappers have a high memory consumption and are slow. Some light-weight mappers have been proposed for faster runtime, but their accuracy is not comparable to state-of-the-art mappers. With the increasing number of available reference sequences, memory-efficient and fast methods for read mapping without the large accuracy drop are desired. A general trade-off with seed-chain-extend mappers is selecting a single, fixed seed size, which forces a compromise between sensitivity and specificity. Results: We present synpact, a long-read mapper that uses several seed sizes (a hierarchy) constructed with Locally Consistent Parsing (LCP) over syncmers. A read is mapped by querying for matches at different levels, followed by sliding window voting. By storing only the coarse upper levels rather than the full hierarchy, the index holds several times fewer entries, while still handling errors by falling back from coarser to finer stored levels at query time. We benchmark synpact against popular long-read mappers on four genomes and different read lengths. For simulated PacBio HiFi data, synpact matches or approaches minimap2 accuracy with higher precision in most cases, while using roughly 5-13 times less peak memory (e.g., about 0.8GB vs. 10.7GB on human) and mapping faster on large or repetitive genomes (e.g., about 10 to 13 times faster than minimap2 on rye). On real HiFi reads synpact has high concordance with minimap2 across the four genomes, as opposed to the other lightweight long-read mappers. Availability and Implementation: synpact is written in Rust and is available at https://github.com/mahmudsami/synpact

20

Comp2GPR: A Sequence-Driven Framework for Gene.Protein-Reaction Rule Reconstruction

Castillo, S.

2026-06-26 bioinformatics 10.64898/2026.06.24.734174 medRxiv

Top 1%

7.3%

Show abstract

Accurate gene-protein-reaction (GPR) associations are essential for the predictive performance of genome-scale metabolic models (GEMs),as they define the mapping between genes, enzymes, and metabolic reactions. However, GPR rules are often incomplete or inconsistent due to limitations in annotation transfer and the ambiguous representation of multi-subunit protein complexes, leading to errors in downstream analyses such as gene essentiality prediction. Here, I introduce Comp2GPR, an automated pipeline for reconstructing GPR rules that integrates curated protein complex information with sequence-level evidence. Protein complexes were sourced from the Complex Portal and subjected to an AI-assisted curation workflow to retain only metabolically relevant assemblies. Comp2GPR combines deterministic sequence similarity mapping with explicit rule construction to generate Boolean GPR expressions that accurately represent obligate subunit relationships and isoenzyme redundancy. I evaluated the impact of the reconstructed GPR rules by integrating them into the Yeast9 metabolic model and comparing gene essentiality predictions with the original model. While global performance metrics remained largely unchanged, the updated model achieved a net improvement in prediction accuracy through gene-level corrections. Overall, Comp2GPR demonstrates that combining curated protein complex data with sequence-based validation improves the accuracy, interpretability, and reproducibility of GPR rules. The method provides a robust framework for enhancing metabolic model annotations and supports more reliable simulation-based analyses.